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ABSTRACT 

Specific to Math Information Retrieval is combining text with math¬ 
ematical formulae both in documents and in queries. Rigorous 
evaluation of query expansion and merging strategies combining 
math and standard textual keyword terms in a query are given. It is 
shown that techniques similar to those known from textual query 
processing may be applied in math information retrieval as well, 
and lead to a cutting edge performance. Striping and merging par¬ 
tial results from subqueries is one technique that improves results 
measured by information retrieval evaluation metrics like Bpref. 

Categories and Subject Descriptors 

H. 3.3 [Information Systems]: Information storage and Retrieval— 
Information Search and Retrieval ; 1.7 [Computing Methodolo¬ 
gies]: Document and text Processing —Index Generation 

General Terms 

Algorithms. Design, Experimentation, Performance 

Keywords 

query reformulation, query expansion, digital mathematical libraries, 
math indexing and retrieval, ranking 

I. MOTIVATION 

There are about 350,000,000 formulae in 1,000,000 papers in 
arXiv.org to be indexed and searched in addition to a keyword-based 
full-text search. Processing of structured objects like mathemati¬ 
cal formulae is not yet supported in production IR systems. First 
deployed Math Information Retrieval (MIR) system that allowed 
searching formulae was system |[3j used in the European Digital 
Mathematical Library EuDML Math-aware search is now planned 
on Wikipedia and arXiv.org. For rigorous evaluation of existing MIR 
system prototypes new Math Tasks have been set up at NTCIR-10 
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and NTCIR-11 conferences (31 . There are now datasets and query 
relevance assessments available allowing MIR research community 
to rigorously evaluate available systems and their ranking strategies. 

In this paper we open a new area of research related to the query 
relaxation based on combining math and text keywords as well as 
merging results of relaxed subqueries. Different strategies in detail 
using datasets from NTCIR-11 Math Task 2 are evaluated. 

Combination of multiple formulae and multiple text keywords 
in one query used in NTCIR-11 Math 2 Task Q seems to be more 
consistent with the real situation of a human using both textual key¬ 
words and math formulae to express search intent. Math formulae 
are a means how to allow the user to filter out relevant documents 
from the entire database. They are complementary to the textual 
keywords, not the sole way of expressing the search intent. 

The MlaS system |6) supports these kind of queries natively. 
All the keywords are posted to the system in one text field—the 
formulae are written in MathML or TgX notation with added dollar 
signs ($) on both sides of the TgX formulae. Formulae and text 
keywords are separated by a single space. The keywords, sometimes 
consisting of more than one word, are surrounded with a single 
quotation mark (") to handle multi-word keywords as a single entity. 
For experiments described in this paper we are using open source 
system MlaS and NTCIR-11 data. 

MlaS is a full-text based search system with and extension for 
processing mathematical expressions. The formulae from docu¬ 
ments and queries are canonicalized, expanded to generalized forms 
to allow similarity matching, weighted and translated to linear form 
to be stored in a full-text index. Documents are ranked with a mod¬ 
ified TF-IDF formula that considers the similarity of the matched 
formulae. 

2. COMPLEX QUERY RELAXATION 

To increase recall of not very successful queries as well as the 
overall precision, query expansion and resubmission is a useful 
technique. When a user posts a query that finds no (or very few) 
results, in order to give at least some results to the user albeit with a 
lower score, the query can be modified or relaxed and the search run 
again. A method to expand a query to multiple queries where each 
query is a subset of the original query consisting of mathematical 
and textual terms has been proven to be very helpful (5). This was 
the first experiment in this direction in MIR. 

Two types of query relaxation are possible. One way is to reduce 
the number of terms if the query consists of more than one term. 
A combination of reduced terms needs to be selected, especially if 
the query consists of text as well as math terms. More query term 


combinations can be run through the system one after another. The 
important step is then an effective algorithm for merging result lists 
with an appropriate weighting. The basic rule for weighted merging 
is that the more reduced a query the lower the score its individual 
results should get. 

Another type of query relaxation is mathematical expression 
relaxation. If a query expression is an actual formula with an equal 
sign, the expression can be split to the left and right side of the equal 
sign. These expressions can then form a new query. If the system 
supports expressions with wild cards, queries could be relaxed by 
automatically inserting these. We experimented with the reduction 
of the number of terms in the individual text and math parts of the 
queries. 

3. LRO QUERY EXPANSION 

In our approach, the original query consisting of k keywords and 
/ formulae is used to generate a set of ‘subqueries’. At first, the 
original query is used. Then subqueries are generated one by one 
removing the keywords from the query until the query consists of / 
formulae only. The rest of the subqueries are generated with all the 
keywords and with formulae removed one by one until the query 
consisting of k keywords only is reached. We call this expansion 
method LRO (Leave Rightmost Out). 

An example of the complete ‘subqueries’ generation sequence for 
a query consisting of two formulae and three keywords is shown in 
Example [I] 
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The percentage of results returned by individual subqueries 
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Figure 1: Relative number of results found using different sub¬ 
queries for every query in LRO CMath run 


subquery 1 (the original query): 

subquery 2: 

subquery 3: 

subquery 4: 

subquery 5: 

subquery 6: 


fi fi ki k 2 h 

f fi h k 2 

f h h 

f f 

f ki k 2 k 3 

ki k 2 k 3 


This kind of query expansion provide users with results on more 
general queries than the user originally posted. We consider this 
behavior useful especially for a ‘research’ search as this shows the 
user a wider context of the query that could possibly reveal new and 
unexpected connections and paths to follow in the research. 


Example 1: Complete sequence of subqueries derived from the 
original user’s query 

All the subqueries are one by one used to query the system and 
the partial results lists are merged (see the next Section) to the final 
list that is presented to the user. 

The statistics of the relative number of results found using each 
of the subqueries in CMath run (see Table|T| are shown in Figure[T] 
Every subquery was limited to at most 1,000 results as requested 
in the NTCIR-11 Math Task. The graph shows that the use of 
the original unmodified query usually resulted in much less than 
requested 1,000 results. The use of the results of multiple subqueries 
thus provides significantly more results that are (at least partially) 
relevant to the original topics. 

Please note that the last subquery does not contain any formulae, 
i.e. subquery 6 in Example|T| is standard full text search keyword 
query with no involvement of mathematical elements whatsoever. 

Please also note that this algorithm does not cover all the possible 
combinations of keywords and formulae as well as ‘unreasonably’ 
handle different formulae differently—in Example[T]formula is 
used in five subqueries in contrast to the four uses of f 2 with no 
reason to prefer before f 2 . This simplification was used to keep the 
number of subqueries small enough to reach an acceptable response 
time even for interactive real users as the total number of subqueries 
would increase rapidly with the number of formulae and keywords 
in the query if all their possible combinations should be used. 

The cumulative total MlaS search time for all 50 queries in CMath 
run was 10.81 seconds. Cumulative totals for the other three runs are 
comparable: PMath 12.01 s, PCMath 14.70 s and for TeX 19.83 s. 


4. MERGING OF RESULTS 

The next important step of the query expansion and resubmission 
procedure is the merging of result lists with an appropriate weighting. 
In conjunction with LRO expansion method we use a method we 
refer to as ‘strip-merging’. 

Every subquery results in an ordered list of items with a scor^J 
assigned to each of the results. However, these scores are only 
comparable within the context of their result list. That means that a 
result n with a score of 0.25 from the subquery 1 is not necessarily 
more relevant to the subquery 1 than a result r 2 with a score of 0.15 
from the subquery 2 even though 0.25 > 0.15 as absolute scores are 
incomparable across different subqueries’ results lists. Thus, it is 
not possible to generate a final results list as a simple combination 
of results from all the subqueries ordered by the score. 

Another reason to use a more sophisticated results merging pro¬ 
cedure is that the results for the original query should be preferred 
to the results found for subqueries. On the other hand, it is very 
possible that the first result of a subquery could be more relevant for 
the user than the 10th result of the original query. 

To produce the final results list from the subqueries according to 
this hypothesis we used a method we refer to as ‘ strip-merging ’ of 
the results. The main idea is to interleave the ‘strips’ of hits from 
all the ordered results lists from the subqueries. The less modified 
subquery to the original query the ‘wider’ strip of hits is used in the 
higher position in the final result list. 

Let us have x subqueries (the original one and .r - 1 derived 
subqueries). The top x most relevant results in the final result list 

'Measure of relevance to the query. 
























































are the first x most relevant results from the original query result list, 
then x - 1 most relevant results from the first derived subquery are 
added, then x — 2 results from the second subquery and so on until 
the first most relevant result from the last derived subquery is added. 
This procedure is then repeated with the next x results from the 
results list to the original query, x - 1 results from the first subquery 
etc. until the desired amount of results is reached. If all the results 
from a subquery are used and there are no more left we continue 
without changing the width of the strips for the other subqueries. 

5. OTHER QUERYING STRATEGIES AND 
RESULT MERGING 

Original Query Only: OQO. 

The basic reference querying strategy is the use of the original 
query without any modifications or derived subqueries. Results 
found for the original query is the final list of results returned to the 
user. 

Math Terms Only: MTO. 

Math Terms Only querying strategy is simple modification of the 
Original Query Only strategy: The query consists of formulae from 
the original query, all the text keywords are removed from the query. 

Text Terms Only: TTO. 

In Text Terms Only strategy the query consists of only text key¬ 
words from the original query. 

All Possible Subqueries: APS. 

The opposite extreme to using only the original query only is to 
use all the possible subqueries derivable from the original query. 
Provided the original query consists of x formulae and y text key¬ 
words, all the possible combinations of formulae fi,...,f x and text 
keywords k \,..., k y provide us with 2 x+y - 1 non-empty subqueries 
(including the original query itself). 

Every subquery can be easily identified by a ‘bit mask’ represent¬ 
ing the inclusion/exclusion of particular components of the original 
query. For example, subquery 5 in Example[T]can be represented 
with mask 10-111. 

The subquery mask can also be used to express importance and 
degree of modification of the particular subquery in contrast to the 
original query. We call this number the ‘ mask weight ’ and it is 
defined as mask weight = 2 V 2 f x + £ k y , where f x is value of the 
jc-th bit in the formulae part of the mask and k y value of the y-th bit 
in the keywords part. The value of the formula bit is multiplied by 
two to increase importance of subqueries with maths components. 

In the All Possible Subqueries querying strategy the final list of 
results is built up from results of particular subqueries as follows: 

1. Lists of results from all the subqueries are ordered according 
to their mask weights. 

2. Let w s be mask weight of the subquery s. For every subquery 
in the ordered subquery list remove w s top results from the 
s-th query result list and put them to the final result list. 

3. Repeat Step [2] until all the results were moved to the final 
result list or a desired numbeflof results in the final result list 
is reached. 

Leave One Out: LOO. 

The Leave One Out querying strategy is similar to the All Possible 
Subqueries strategy with the following differences: 

2 We put up to 1,000 results to every final result list. 


• We work with a restricted set of the subqueries—only the orig¬ 
inal query and derived subqueries with exactly one component 
(one formula or one text keyword) excluded are used. 

• In Step[2]of the merging algorithm we do not use mask weight 
as the ‘strip-weight’. The strip-weight is 2 if taking results 
from the original query results list, and 1 otherwise. 

Please note that the ordering of the result lists of subqueries with 
the equal mask weight is implementation dependant and not defined. 

Leave One or Two Out: LOoTO. 

The Leave One or Two Out querying strategy is further extension 
of the similar Leave One Out strategy: 

• The set of the subqueries consists of the original query and 
derived subqueries with exactly one or two components ex¬ 
cluded. 

• The strip-weight is 3 if taking results from the original query 
results list, 2 if taking results from a derived query with ex¬ 
actly one component excluded, and 1 otherwise. 

Once again, the ordering of the result lists of subqueries with the 
equal mask weight is implementation dependant and not defined. 

6. EVALUATION 

We evaluated the strategies using NTCIR-11 Math-2 Task col¬ 
lection of documents and relevance judgements provided by the 
conference organizers |Tj. The document collection consists of 
105,120 scientific documents from the arXiv pre-print archive. The 
documents were divided into 8,301,578 paragraph units. The whole 
collection contains 59,647,566 mathematical expressions. There are 
50 topics (queries) consisting of one or more math expressions as 
well as one or more textual terms. The judged pool consisted of 
2,501 relevance assessments, ranked from 0 to 4. 

In our evaluation we only used binary relevance judgements. 
0 rank for non-relevant, ranks 1-4 for relevant e.g. partially relevant 
documents according to the original NTCIR-11 evaluation. 

For the evaluation tool we used a modified version of Terrier’s 
evaluation tool 0. The modification resides in added computa¬ 
tion of Bpref metric. Bpref is supposed to be more precise than 
MAP when the judged pool is far from complete |2j, which is the 
case for our situation because of the use NTCIR-11 data relevance 
assessments. 

We evaluated the performance of different query expansion meth¬ 
ods connected with different results merging methods described in 
Section [5] The results are summed up in Table [7] As baseline we 
consider OQO column in Table [T] as this is the current state-of-the- 
art in query expansion in most of the MIR systems. 

Two sets of runs were evaluated. They differed in the math nota¬ 
tion that was used for mathematical expressions in queries. Content 
MathML was used for queries in CMath runs and Presentation 
MathML in PMath runs. We used these two notations in queries to 
see, whether they have any impact on the usefulness of individual 
query expansion methods. 

In addition to Bpref as effectiveness metrics we have used Preci¬ 
sion at 1, 5, 10 (P@l, P@5, P@10) and Mean Average Precision 
(MAP) as they are known in the IR community. 

7. SUMMARY AND CONCLUSIONS 

Our experiments have shown the importance of query reduction 
and results slicing/merging techniques in a MIR system like MlaS 
with mixed query sections containing multiple math tokens as well 
as multiple text tokens at the same time. We use AND logical 



Table 1: Evaluation metrics for CMath and PMath runs. Values are averaged over 50 NTCIR queries/topics. Names of the strate¬ 
gies are described in Sections [3j [4] and [5] —OQO considered as the baseline. The best value of each metric across the strategies is 
highlighted in bold 


metric 

run 

OQO 

MTO 

TTO 

LOO 

LOoTO 

APS 

LRO 

Bpref 

CMath 

0.2544 

0.2673 

0.3739 

0.4623 

0.4636 

0.4653 

0.4734 

Bpref 

PMath 

0.2496 

0.2694 

0.3739 

0.448 

0.449 

0.449 

0.4547 

MAP avg 

CMath 

0.087 

0.0879 

0.1387 

0.168 

0.1479 

0.1432 

0.2152 

MAP avg 

PMath 

0.0704 

0.0719 

0.1387 

0.1502 

0.1315 

0.1252 

0.1943 

P@1 avg 

CMath 

0.6667 

0.6207 

0.72 

0.68 

0.62 

0.58 

0.96 

P@1 avg 

PMath 

0.6538 

0.6 

0.72 

0.64 

0.58 

0.54 

0.94 

P@5 avg 

CMath 

0.4133 

0.3793 

0.604 

0.628 

0.52 

0.516 

0.872 

P@5 avg 

PMath 

0.3462 

0.32 

0.604 

0.6 

0.484 

0.456 

0.848 

P@10 avg 

CMath 

0.27 

0.2759 

0.35 

0.432 

0.412 

0.368 

0.546 

P@10 avg 

PMath 

0.2308 

0.228 

0.35 

0.406 

0.384 

0.34 

0.506 


operator between text keywords group and math formulae group 
aiming for better precision narrowing down the result set of one 
group with the other. 

The importance of expansion is underpinned in the evaluation 
results of baseline OQO run against all other runs using query 
expansion (LRO, LOO, LOoTO, APS). Both Bpref as well as MAP 
are considerably lower than any of the other runs. 

The power of the individual parts of the query, e.g. math and 
text parts, is shown in the MTO and TTO runs. It is interesting to 
see how separate sections perform w.r.t. baseline OQO run. This 
indicates that the original topic formulation OQO is too restrictive. 

From the runs that used query expansion/results merging the LRO 
run performed the best. It prefers the math part of the query over the 
text part. However, from the TTO run we see that text terms alone 
retrieve more relevant results than OQO and the LRO run covers 
these results as well. This helps when all the math terms in the query 
fail, for instance due to the large complexity of the formula. 

As described in Section[3]subqueries are constructed by removing 
the last keyword/formula one at a time. This may lead to a suspi¬ 
cion, that the success of LRO run resides in the formulation of the 
original query—if the terms in the original query were to be ordered 
by its significance, i.e. removing the last keyword means removing 
the least important keyword which results in a more specific query, 
differently formulated queries (i.e. with permuted keywords) would 
fail in this strategy. To verify this hypothesis, we created a reversed 
original queries. The order of the keywords and formulae were re¬ 
verted in their respective query groups. The results of these queries 
with the LRO strategy were roughly the same as non-reverted origi¬ 
nal queries. This disproves our hypothesis and means that the LRO 
strategy used with ordered query tokens by their specificity gives 
the best results. Other expansion/merging methods yielded slightly 
worse evaluation results. 

It is hard to decide whether leaving one or two text or/and math 
tokens helps the query performance. It is heavily dependant on the 
actual terms in the queries and their restrictiveness. 

PMath runs show similar results with slightly lower overall scores. 
This is caused by a less precise Presentation MathML query formu¬ 
lae, which may contain a semantically less important markup that 
may lead to a mismatch between query expression and those found 
in documents. 


Paying attention to query reduction and results slicing is of utmost 
importance in MIR. Content MathML gives slightly better results 
than Presentation MathML and helps to narrow a semantic gap. 
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